SYS 6018 | Spring 2022 | University of Virginia
During the 2010 Earthquake in Haiti, many people lost their homes. These homeless people generally live under blue TARP. Our task was to train the model to quickly locate these homeless people in aerial images, in other words to locate blue Tarp.When we know where the homeless are, we can deploy our resources more effectively. More accurate estimates of the number of people affected. Aerial images are RGB three-channel images, and each sample is a pixel point. In the data, the classification of each sample has been marked, so I use supervised learning methods.
Load data, explore data, etc.
# Load Required Packages
library(tidyverse)
library(glmnet)
library(yardstick)
library(broom)
library(plotly)
library(nnet)
library(e1071) # functions for svm
library(pROC) # functions for AUROC
library(naivebayes) # package for naivebayes
library(caret)
library(MASS) # package for lda and qda
library(purrr) # package for thresholds searching
library(randomForest) # package for randomforest
library(gbm) # package for gbm
#-- Load Data and add binary outcome
data = read_csv('HaitiTraining.csv') %>%
mutate(bluetarp = ifelse(Class == 'Blue Tarp', 1L, 0L) %>% factor())
#-- EDA & Explore data
head(data,n=50)#> # A tibble: 50 × 5
#> Class Red Green Blue bluetarp
#> <chr> <dbl> <dbl> <dbl> <fct>
#> 1 Vegetation 64 67 50 0
#> 2 Vegetation 64 67 50 0
#> 3 Vegetation 64 66 49 0
#> 4 Vegetation 75 82 53 0
#> 5 Vegetation 74 82 54 0
#> 6 Vegetation 72 76 52 0
#> 7 Vegetation 71 72 51 0
#> 8 Vegetation 69 70 49 0
#> 9 Vegetation 68 70 49 0
#> 10 Vegetation 67 70 50 0
#> # … with 40 more rows
ggplot(data) +
geom_bar(mapping = aes(x = Class))plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$Class) %>% layout(autosize = F, width = 500, height = 500)#-- The ratio of bluetarp and non-bluetarp
ggplot(data) +
geom_bar(mapping = aes(x = bluetarp))plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$bluetarp) %>% layout(autosize = F, width = 500, height = 500)#-- Make function to calculate confusion matrix metrics
get_conf <- function(thres, y, p_hat) {
yhat = ifelse(p_hat >= thres, 1L, 0L)
tibble(threshold = thres,
TP = sum(y == 1 & yhat == 1),
FP = sum(y == 0 & yhat == 1),
FN = sum(y == 1 & yhat == 0),
TN = sum(y == 0 & yhat == 0)
)
}set.seed(2021)
nfolds = 10 # number of folds
n= nrow(data) # number of observations in training dataset
folds = sample(rep(1:nfolds, length = n))
train_control <- trainControl(
method = "cv",
number = nfolds
)There are three hyperparameters in naive bayes model usekernel: parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate, adjust: allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate), fL: allows us to incorporate the Laplace smoother.
# set up tuning grid
search_grid <- expand.grid(
usekernel = c(TRUE, FALSE),
fL = 0:5,
adjust = seq(0, 5, by = 1)
)
# train model
model <- train(
x = data[c("Red","Green","Blue")],
y = data$bluetarp,
method = "nb",
trControl = train_control,
tuneGrid = search_grid
)
# top 5 modesl
model$results %>%
top_n(50, wt = Accuracy) %>%
arrange(desc(Accuracy))#> usekernel fL adjust Accuracy Kappa AccuracySD KappaSD
#> 1 TRUE 0 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 2 TRUE 1 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 3 TRUE 2 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 4 TRUE 3 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 5 TRUE 4 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 6 TRUE 5 2 0.9818156 0.5968692 0.0009338387 0.03017821
#> 7 TRUE 0 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 8 TRUE 1 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 9 TRUE 2 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 10 TRUE 3 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 11 TRUE 4 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 12 TRUE 5 3 0.9797916 0.5292260 0.0009882855 0.03304920
#> 13 TRUE 0 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 14 TRUE 1 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 15 TRUE 2 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 16 TRUE 3 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 17 TRUE 4 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 18 TRUE 5 1 0.9770561 0.5704484 0.0018426201 0.03219958
#> 19 TRUE 0 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 20 TRUE 1 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 21 TRUE 2 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 22 TRUE 3 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 23 TRUE 4 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 24 TRUE 5 4 0.9747157 0.3372781 0.0012105118 0.05122912
#> 25 FALSE 0 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 26 FALSE 0 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 27 FALSE 0 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 28 FALSE 0 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 29 FALSE 0 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 30 FALSE 0 5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 31 FALSE 1 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 32 FALSE 1 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 33 FALSE 1 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 34 FALSE 1 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 35 FALSE 1 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 36 FALSE 1 5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 37 FALSE 2 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 38 FALSE 2 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 39 FALSE 2 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 40 FALSE 2 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 41 FALSE 2 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 42 FALSE 2 5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 43 FALSE 3 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 44 FALSE 3 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 45 FALSE 3 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 46 FALSE 3 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 47 FALSE 3 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 48 FALSE 3 5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 49 FALSE 4 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 50 FALSE 4 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 51 FALSE 4 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 52 FALSE 4 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 53 FALSE 4 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 54 FALSE 4 5 0.9707152 0.1502549 0.0006506225 0.03567168
#> 55 FALSE 5 0 0.9707152 0.1502549 0.0006506225 0.03567168
#> 56 FALSE 5 1 0.9707152 0.1502549 0.0006506225 0.03567168
#> 57 FALSE 5 2 0.9707152 0.1502549 0.0006506225 0.03567168
#> 58 FALSE 5 3 0.9707152 0.1502549 0.0006506225 0.03567168
#> 59 FALSE 5 4 0.9707152 0.1502549 0.0006506225 0.03567168
#> 60 FALSE 5 5 0.9707152 0.1502549 0.0006506225 0.03567168
# plot search grid results
plot(model)Naive_Bayes_Model <- modelNB_Prediction = predict(Naive_Bayes_Model,newdata= data[c("Red","Green","Blue")])
nb.cm = table(NB_Prediction, data$bluetarp)
nb.cm#>
#> NB_Prediction 0 1
#> 0 61200 1139
#> 1 19 883
FP = nb.cm[1,2]
FN = nb.cm[2,1]
TP = nb.cm[2,2]
TN = nb.cm[1,1]
#get the accuracy precision FPR and FPR
nb.acc = (TP + TN)/(FP + FN + TP + TN)
nb.pre = TP/(TP + FP)
nb.tpr = TP/(TP+FN)
nb.fpr = FP/(FP + TN)
nb.auc = auc(as.numeric(data$bluetarp), as.numeric(NB_Prediction))
nb.pre = NB_PredictionLDA_Model <- lda(bluetarp ~ Red + Green + Blue, data = data)
LDA_Prediction = predict(LDA_Model,newdata= data[c("Red","Green","Blue")])
plot(LDA_Model)LDA_yhat <- LDA_Prediction$class
lda.cm = table(LDA_yhat, data$bluetarp)
print(lda.cm)#>
#> LDA_yhat 0 1
#> 0 60607 402
#> 1 612 1620
FP = lda.cm[1,2]
FN = lda.cm[2,1]
TP = lda.cm[2,2]
TN = lda.cm[1,1]
#get the accuracy precision FPR and FPR
lda.acc = (TP + TN)/(FP + FN + TP + TN)
lda.pre = TP/(TP + FP)
lda.tpr = TP/(TP+FN)
lda.fpr = FP/(FP + TN)
lda.auc = auc(as.numeric(data$bluetarp), as.numeric(LDA_Prediction$class))QDA_Model <- qda(bluetarp ~ Red + Green + Blue, data = data)
QDA_Prediction = predict(QDA_Model,newdata= data[c("Red","Green","Blue")])
QDA_yhat <- QDA_Prediction$class
qda.cm = table(QDA_yhat, data$bluetarp)
print(qda.cm)#>
#> QDA_yhat 0 1
#> 0 61201 323
#> 1 18 1699
FP = qda.cm[1,2]
FN = qda.cm[2,1]
TP = qda.cm[2,2]
TN = qda.cm[1,1]
#get the accuracy precision FPR and FPR
qda.acc = (TP + TN)/(FP + FN + TP + TN)
qda.pre = TP/(TP + FP)
qda.tpr = TP/(TP+FN)
qda.fpr = FP/(FP + TN)
qda.auc = auc(as.numeric(data$bluetarp), as.numeric(LDA_Prediction$class))First I’m going to train the KNN model. KNN(k nearest neighbor) is one of the simplest machine learning method. For each data point that needs to be predicted, KNN will look for k sample points closest to it and obtain the classification of prediction points by counting the class of K neighbor sample points. The distance between different sample is depand on their euclidean distance. K is the tuning parameter in this model.It determines the degree of freedom of the model, and as K increases the degree of freedom of the model decreases. ### Tuning Parameter \(k\)
trControl <- trainControl(method = "cv",
number = 5)
KNN_Model <- train(bluetarp ~ .,
method = "knn",
tuneGrid = expand.grid(k = 1:20),
trControl = trControl,
metric = "Accuracy",
data = data[c("Red","Green","Blue","bluetarp")])
print(KNN_Model)#> k-Nearest Neighbors
#>
#> 63241 samples
#> 3 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold)
#> Summary of sample sizes: 50593, 50593, 50592, 50592, 50594
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 1 0.9967426 0.9470160
#> 2 0.9969482 0.9506523
#> 3 0.9971063 0.9532870
#> 4 0.9970430 0.9523247
#> 5 0.9972486 0.9556651
#> 6 0.9971221 0.9536726
#> 7 0.9971221 0.9537533
#> 8 0.9969798 0.9514218
#> 9 0.9970114 0.9519205
#> 10 0.9970589 0.9526353
#> 11 0.9971221 0.9535729
#> 12 0.9971063 0.9533450
#> 13 0.9971537 0.9541309
#> 14 0.9968849 0.9498317
#> 15 0.9969640 0.9509849
#> 16 0.9968849 0.9497238
#> 17 0.9969640 0.9510128
#> 18 0.9969798 0.9512271
#> 19 0.9969324 0.9505657
#> 20 0.9969324 0.9505566
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 5.
plot(KNN_Model)k = 5 is the best parameters for this knn model.
KNN_Model <- FNN::knn(train = scale(data[c("Red","Green","Blue")]),
test = scale(data[c("Red","Green","Blue")]),
cl = data$bluetarp,
k = 5,
prob = TRUE)
##create confusion matrix
knn.cm <- table(KNN_Model, data$bluetarp)
print(knn.cm)#>
#> KNN_Model 0 1
#> 0 61150 60
#> 1 69 1962
FP = knn.cm[1,2]
FN = knn.cm[2,1]
TP = knn.cm[2,2]
TN = knn.cm[1,1]
#get the accuracy precision FPR and FPR
knn.acc = (TP + TN)/(FP + FN + TP + TN)
knn.pre = TP/(TP + FP)
knn.tpr = TP/(TP+FN)
knn.fpr = FP/(FP + TN)
knn.auc = auc(as.numeric(data$bluetarp), as.numeric(KNN_Model))#- Set tuning parameter grid search
grid = expand_grid(
kernel = "radial",
gamma = seq(1/8, 2, length=4),
cost = 10^seq(-2,5,length=4)
)
#- Iterate over folds
perf = tibble() # to save performance metric
for(i in 1:3){ # NOTE: using 3 folds
#-- Set training/test data for fold i
test = which(folds == i) # indices of holdout/validation data
train = which(folds != i) # indices of fitting/training data
n.val = length(test) # number of observations in validation
#-- loop over tuning grid
for(j in 1:nrow(grid)){
tpars = grid[j,]
#: fit model (i.e., estimate model parameters)
fit = e1071::svm(bluetarp ~ Red + Green + Blue,
data = data[train,],
type = "C-classification",
probability=TRUE, # enable probability output
#: tuning parameters
scale = TRUE,
kernel = tpars$kernel,
gamma = tpars$gamma,
cost = tpars$cost)
#: estimate probability in hold out set
p_hat = predict(fit, data[test,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
#: evaluate performance
log_loss = yardstick::mn_log_loss_vec(data$bluetarp[test], p_hat,
event_level = "second")
AUROC = yardstick::roc_auc_vec(data$bluetarp[test], p_hat,
event_level = "second")
eval = tibble(log_loss, AUROC) %>%
mutate(fold = i, tuning = j,
n.val,
kernel=tpars$kernel,
gamma=tpars$gamma,
cost=tpars$cost)
#: update results
perf = bind_rows(perf, eval)
}
}
perf#> # A tibble: 48 × 8
#> log_loss AUROC fold tuning n.val kernel gamma cost
#> <dbl> <dbl> <int> <int> <int> <chr> <dbl> <dbl>
#> 1 0.0229 0.997 1 1 6325 radial 0.125 0.01
#> 2 0.0123 1.00 1 2 6325 radial 0.125 2.15
#> 3 0.00888 1.00 1 3 6325 radial 0.125 464.
#> 4 0.00733 1.00 1 4 6325 radial 0.125 100000
#> 5 0.0213 0.999 1 5 6325 radial 0.75 0.01
#> 6 0.0103 1.00 1 6 6325 radial 0.75 2.15
#> 7 0.00751 1.00 1 7 6325 radial 0.75 464.
#> 8 0.00847 1.00 1 8 6325 radial 0.75 100000
#> 9 0.0220 0.999 1 9 6325 radial 1.38 0.01
#> 10 0.0100 1.00 1 10 6325 radial 1.38 2.15
#> # … with 38 more rows
(avg_perf = perf %>%
group_by(tuning, kernel, gamma, cost) %>%
summarize(
avg_log_loss = mean(log_loss),
avg_AUROC = mean(AUROC),
sd_log_loss = sd(log_loss),
sd_AUROC = sd(AUROC)
) %>%
arrange(avg_log_loss, -avg_AUROC))#> # A tibble: 16 × 8
#> # Groups: tuning, kernel, gamma [16]
#> tuning kernel gamma cost avg_log_loss avg_AUROC sd_log_loss sd_AUROC
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15 radial 2 464. 0.00652 1.00 0.000831 0.0000583
#> 2 7 radial 0.75 464. 0.00680 1.00 0.000977 0.0000532
#> 3 11 radial 1.38 464. 0.00683 1.00 0.000863 0.0000710
#> 4 4 radial 0.125 100000 0.00694 1.00 0.000744 0.0000423
#> 5 3 radial 0.125 464. 0.00777 1.00 0.00122 0.0000801
#> 6 8 radial 0.75 100000 0.00813 1.00 0.000328 0.0000996
#> 7 12 radial 1.38 100000 0.00847 1.00 0.00108 0.0000729
#> 8 10 radial 1.38 2.15 0.00877 1.00 0.00151 0.0000874
#> 9 14 radial 2 2.15 0.00895 1.00 0.00161 0.0000923
#> 10 6 radial 0.75 2.15 0.00898 1.00 0.00160 0.0000979
#> 11 16 radial 2 100000 0.00984 1.00 0.000549 0.000192
#> 12 2 radial 0.125 2.15 0.0103 1.00 0.00239 0.000115
#> 13 5 radial 0.75 0.01 0.0179 0.999 0.00467 0.000264
#> 14 13 radial 2 0.01 0.0180 0.999 0.00557 0.000287
#> 15 9 radial 1.38 0.01 0.0183 0.999 0.00541 0.000304
#> 16 1 radial 0.125 0.01 0.0206 0.997 0.00345 0.000517
ggplot(avg_perf, aes(gamma, cost, fill=avg_log_loss)) +
geom_tile() + scale_y_log10() +
scale_fill_distiller()tune_svm = list(cost = 400, gamma = 2, kernel = "radial")pred = numeric(nrow(data))
for(v in unique(folds)) {
# set fit/eval split
ind_fit = which(folds != v)
ind_eval = which(folds == v)
# fit SVM (for specific tuning parameters)
fit= e1071::svm(bluetarp ~ Red + Green + Blue,
data = data[ind_fit,],
type = "C-classification",
probability=TRUE, # enable probability output
#: tuning parameters
scale = TRUE,
kernel = tune_svm$kernel,
gamma = tune_svm$gamma,
cost = tune_svm$cost)
# estimate probability in hold out set
pred[ind_eval] = predict(fit, data[ind_eval,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]
}
eval_data = tibble(
y = data$bluetarp,
pred = pred,
yhat = ifelse(pred > .50, 1, 0) # hard classification
)
yhat = ifelse(pred > .50, 1, 0) # hard classification
SVM.predict = yhat
# ROC curves
ROC_data = yardstick::roc_curve(eval_data,
truth = y,
estimate = pred,
event_level = "second")
autoplot(ROC_data)svm.cm <- table(SVM.predict, data$bluetarp)
print(svm.cm)#>
#> SVM.predict 0 1
#> 0 61145 97
#> 1 74 1925
FP = svm.cm[1,2]
FN = svm.cm[2,1]
TP = svm.cm[2,2]
TN = svm.cm[1,1]
#get the accuracy precision FPR and FPR
svm.acc = (TP + TN)/(FP + FN + TP + TN)
svm.pre = TP/(TP + FP)
svm.tpr = TP/(TP+FN)
svm.fpr = FP/(FP + TN)
svm.auc = auc(as.numeric(data$bluetarp), as.numeric(RF_Prediction))#> Error in roc.default(response, predictor, auc = TRUE, ...): 找不到对象'RF_Prediction'
#-- Settings
ntree_max = 200 # maximum number of trees in forest
mtry_seq = seq(1, 3 , by=1) # grid of tuning parameters
#-- cv
ACC = tibble() # initiate results df
#-- Run cross-validation
sumTest <- c()
sumYhat <- c()
for(i in 1:length(mtry_seq)) {
acc <- c()
rf = randomForest(bluetarp ~ ., data= data[c("Red","Green","Blue","bluetarp")]
, ntree = ntree_max, mtry = mtry_seq[i], cv.folds = 10)
out = tibble(mtry = mtry_seq[i],
ntree = 1:ntree_max,
acc = rf$err.rate)
ACC = bind_rows(ACC, out)
}
#-- Aggregate Results
ACC_agg = ACC %>%
group_by(mtry, ntree) %>% summarize(acc = 1 - mean(acc)) %>% ungroup()
ACC_agg %>%
mutate(mtry = factor(mtry)) %>% # make mtry factor for plotting
ggplot(aes(ntree, acc, color=mtry)) + geom_line() +
coord_cartesian(ylim=c(0.95, 0.999))RF_Tuning <- subset(ACC_agg, acc == max(acc))
print(RF_Tuning)#> # A tibble: 1 × 3
#> mtry ntree acc
#> <dbl> <int> <dbl>
#> 1 2 57 0.982
optNtree = RF_Tuning$ntree
optMtry = RF_Tuning$mtry
RF_Model = randomForest(bluetarp ~ ., data= data[c("Red","Green","Blue","bluetarp")]
, ntree = optNtree, mtry = optMtry, cv.folds = 10)
RF_Prediction = predict(RF_Model,data = data[c("Red","Green","Blue")])
rf.cm <- table(RF_Prediction, data$bluetarp)
print(rf.cm)#>
#> RF_Prediction 0 1
#> 0 61138 104
#> 1 81 1918
FP = rf.cm[1,2]
FN = rf.cm[2,1]
TP = rf.cm[2,2]
TN = rf.cm[1,1]
#get the accuracy precision FPR and FPR
rf.acc = (TP + TN)/(FP + FN + TP + TN)
rf.pre = TP/(TP + FP)
rf.tpr = TP/(TP+FN)
rf.fpr = FP/(FP + TN)
rf.auc = auc(as.numeric(data$bluetarp), as.numeric(RF_Prediction))#-- Settings
ntree_max = 200 # maximum number of trees in forest
depth.seq = seq(1, 10 , by=1) # grid of tuning parameters
ACC.Boost = tibble() # initiate results df
phat<-c()
K = 10
#-- k fold cv
for(k in 1:K) {
test = (folds == k)
train = (folds != k)
for(i in 1:length(depth.seq)) {
model.boost = gbm(as.character(bluetarp)~., data=data[train,c("Red","Green","Blue","bluetarp")], distribution ="bernoulli", n.trees=ntree_max, interaction.depth = depth.seq[i])
mse <- c()
for(j in 1:ntree_max){
phat <- c()
phat <- predict(model.boost, newdata = data[test,c("Red","Green","Blue")], n.trees = j,type="response")
#yhat = ifelse(phat >= 0.2, 1L, 0L)
tmse = sum((as.numeric(phat) - as.numeric(data[test,]$bluetarp))^2)/length(phat)
#get the mse of each ntree and fold
mse = c(mse,tmse)
}
t <- model.boost$train.error
out = tibble(depth = depth.seq[i],
ntree = 1:ntree_max,
mse = mse,
iter = k)
ACC.Boost = bind_rows(ACC.Boost, out)
}
}
#-- Aggregate Results
ACC_agg.boost = ACC.Boost %>%
group_by(depth, ntree) %>% summarize(mse = mean(mse)) %>% ungroup()
#-- Plot Results
ACC_agg.boost %>%
mutate(depth = factor(depth)) %>% # make depth factor for plotting
ggplot(aes(ntree, mse, color=depth)) + geom_line()BT_Tuning <- subset(ACC_agg.boost, mse == min(mse))
print(BT_Tuning)#> # A tibble: 1 × 3
#> depth ntree mse
#> <dbl> <int> <dbl>
#> 1 10 5 0.998
optDepth = BT_Tuning$depth
optNtree = BT_Tuning$ntreeBoost_Model = gbm(as.character(bluetarp)~., data=data[c("Red","Green","Blue","bluetarp")], distribution ="bernoulli", n.trees=optNtree, interaction.depth = optDepth)
Boost.predict = predict(Boost_Model,data, type="response")#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = Boost.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres#> # A tibble: 20 × 6
#> threshold TP FP FN TN cost
#> <dbl> <int> <int> <int> <int> <dbl>
#> 1 0.01 2022 61219 0 0 61219
#> 2 0.0621 1942 260 80 60959 340
#> 3 0.114 1897 232 125 60987 357
#> 4 0.166 1872 227 150 60992 377
#> 5 0.218 1861 164 161 61055 325
#> 6 0.271 1844 134 178 61085 312
#> 7 0.323 1788 87 234 61132 321
#> 8 0.375 1756 84 266 61135 350
#> 9 0.427 1619 52 403 61167 455
#> 10 0.479 1478 24 544 61195 568
#> 11 0.531 1362 11 660 61208 671
#> 12 0.583 1305 6 717 61213 723
#> 13 0.635 111 0 1911 61219 1911
#> 14 0.687 28 0 1994 61219 1994
#> 15 0.739 7 0 2015 61219 2015
#> 16 0.792 0 0 2022 61219 2022
#> 17 0.844 0 0 2022 61219 2022
#> 18 0.896 0 0 2022 61219 2022
#> 19 0.948 0 0 2022 61219 2022
#> 20 1 0 0 2022 61219 2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
geom_vline(xintercept = 1/6, color="red") +
scale_x_continuous(breaks = seq(0, 1, by=.02))Boost_opt_threshold = 0.1yhat = ifelse(Boost.predict >= Boost_opt_threshold, 1L, 0L)
bt.cm <- table(yhat, data$bluetarp)
print(bt.cm)#>
#> yhat 0 1
#> 0 60987 122
#> 1 232 1900
FP = bt.cm[1,2]
FN = bt.cm[2,1]
TP = bt.cm[2,2]
TN = bt.cm[1,1]
#get the accuracy precision FPR and FPR
bt.acc = (TP + TN)/(FP + FN + TP + TN)
bt.pre = TP/(TP + FP)
bt.tpr = TP/(TP+FN)
bt.fpr = FP/(FP + TN)
bt.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))In this part I refer to the answers in the homework.The elasticnet models have two tuning parameters, α and λ. Once α is selected, then the glmnet() function can efficiently fit the models for a set of λ values. ### Tuning Parameters
#-- Initialize
alpha_seq = c(0, .25, .5, .75, 1) # set of alpha values to search
perf_alpha = tibble()
#-- Search over alpha (and lambda)
for(i in seq_along(alpha_seq)){
a = alpha_seq[i]
fit = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = a,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")
perf = broom::tidy(fit) %>% slice_min(estimate) %>% mutate(alpha = a)
perf_alpha = bind_rows(perf_alpha, perf)
}
perf_alpha %>%
ggplot(aes(alpha, estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.05)perf_alpha %>% arrange(estimate) %>%
mutate_at(vars(estimate:conf.high), round, digits=3)#> # A tibble: 5 × 7
#> lambda estimate std.error conf.low conf.high nzero alpha
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0.00000550 0.028 0.001 0.027 0.029 3 1
#> 2 0.00000555 0.028 0.001 0.027 0.029 3 0.75
#> 3 0.00000832 0.029 0.001 0.028 0.03 3 0.5
#> 4 0.0000166 0.032 0.001 0.031 0.033 3 0.25
#> 5 0.00416 0.132 0.002 0.13 0.134 3 0
tune_enet = perf_alpha %>%
slice_min(estimate)
tune_enet#> # A tibble: 1 × 7
#> lambda estimate std.error conf.low conf.high nzero alpha
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0.00000550 0.0281 0.00113 0.0270 0.0293 3 1
Enet_Model = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = tune_enet$alpha,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")
Enet.predict = predict(Enet_Model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = Enet.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres#> # A tibble: 20 × 6
#> threshold TP FP FN TN cost
#> <dbl> <int> <int> <int> <int> <dbl>
#> 1 0.01 2015 3857 7 57362 3864
#> 2 0.0621 1973 659 49 60560 708
#> 3 0.114 1932 481 90 60738 571
#> 4 0.166 1899 435 123 60784 558
#> 5 0.218 1866 387 156 60832 543
#> 6 0.271 1823 83 199 61136 282
#> 7 0.323 1809 76 213 61143 289
#> 8 0.375 1789 65 233 61154 298
#> 9 0.427 1776 59 246 61160 305
#> 10 0.479 1767 47 255 61172 302
#> 11 0.531 1754 42 268 61177 310
#> 12 0.583 1739 41 283 61178 324
#> 13 0.635 1729 36 293 61183 329
#> 14 0.687 1714 30 308 61189 338
#> 15 0.739 1693 25 329 61194 354
#> 16 0.792 1675 23 347 61196 370
#> 17 0.844 1640 19 382 61200 401
#> 18 0.896 1595 14 427 61205 441
#> 19 0.948 1524 6 498 61213 504
#> 20 1 0 0 2022 61219 2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
scale_x_continuous(breaks = seq(0, 1, by=.02))Enet.optThreshold = 0.27052632
#evaluation
yhat = ifelse(Enet.predict >= Enet.optThreshold, 1L, 0L)
Enet.cm <- table(yhat, data$bluetarp)
print(Enet.cm)#>
#> yhat 0 1
#> 0 61136 199
#> 1 83 1823
FP = Enet.cm[1,2]
FN = Enet.cm[2,1]
TP = Enet.cm[2,2]
TN = Enet.cm[1,1]
#get the accuracy precision FPR and FPR
Enet.acc = (TP + TN)/(FP + FN + TP + TN)
Enet.pre = TP/(TP + FP)
Enet.tpr = TP/(TP+FN)
Enet.fpr = FP/(FP + TN)
Enet.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))LR_model = cv.glmnet(x = as.matrix(data[c("Red","Green","Blue")]), y = data$bluetarp, family="binomial", alpha = 0,
foldid = folds, # use same folds for all alpha
type.measure = "deviance")LR.predict = predict(LR_model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
#-- Calculate for range of thresholds
thres = seq(0.01, 1, length=20) # set of thresholds
perf_thres = map_df(thres, ~get_conf(., y=data$bluetarp, p_hat = LR.predict)) %>%
mutate(cost = FN*1 + FP*1)
perf_thres#> # A tibble: 20 × 6
#> threshold TP FP FN TN cost
#> <dbl> <int> <int> <int> <int> <dbl>
#> 1 0.01 2022 52350 0 8869 52350
#> 2 0.0621 1769 2329 253 58890 2582
#> 3 0.114 1606 779 416 60440 1195
#> 4 0.166 1450 517 572 60702 1089
#> 5 0.218 1288 359 734 60860 1093
#> 6 0.271 1129 0 893 61219 893
#> 7 0.323 1023 0 999 61219 999
#> 8 0.375 935 0 1087 61219 1087
#> 9 0.427 834 0 1188 61219 1188
#> 10 0.479 691 0 1331 61219 1331
#> 11 0.531 505 0 1517 61219 1517
#> 12 0.583 336 0 1686 61219 1686
#> 13 0.635 204 0 1818 61219 1818
#> 14 0.687 104 0 1918 61219 1918
#> 15 0.739 50 0 1972 61219 1972
#> 16 0.792 4 0 2018 61219 2018
#> 17 0.844 0 0 2022 61219 2022
#> 18 0.896 0 0 2022 61219 2022
#> 19 0.948 0 0 2022 61219 2022
#> 20 1 0 0 2022 61219 2022
#-- Make cost curve plot
perf_thres %>%
ggplot(aes(threshold, cost)) +
geom_line() +
geom_point() + geom_point(data = . %>% slice_min(cost), color="red", size=4) +
scale_x_continuous(breaks = seq(0, 1, by=.02))LR.optThreshold = 0.27052632yhat = ifelse(LR.predict >= LR.optThreshold, 1L, 0L)
lr.cm <- table(yhat, data$bluetarp)
print(lr.cm)#>
#> yhat 0 1
#> 0 61219 893
#> 1 0 1129
FP = lr.cm[1,2]
FN = lr.cm[2,1]
TP = lr.cm[2,2]
TN = lr.cm[1,1]
#get the accuracy precision FPR and FPR
lr.acc = (TP + TN)/(FP + FN + TP + TN)
lr.pre = TP/(TP + FP)
lr.tpr = TP/(TP+FN)
lr.fpr = FP/(FP + TN)
lr.auc = auc(as.numeric(data$bluetarp), as.numeric(yhat))Threshold could be discussed here or within each model section above.
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
|---|---|---|---|---|---|---|---|
| Random Forest | Ntree = 40,mtry = 3 | 0.97 | NULL | 0.9966 | 0.9623 | 0.0017 | 0.9433 |
| Boosted Tree | Ntree = 4, depth = 10 | 0.954 | 0.1 | 0.994 | 0.917 | 0.0029 | 0.9099 |
| SVM | cost = 400, gamma = 2, kernal = radial | 0.972 | 0.1 | 0.997 | 0.9625 | 0.0015 | 0.9530 |
| Naive Bayes | Nonparametric Bandwidth = 2 | 0.718 | NULL | 0.977 | 0.691 | 0.016 | 0.5103858 |
| LDA | NULL | 0.896 | NULL(0.5) | 0.983 | 0.725 | 0.006 | 0.801 |
| QDA | NULL | 0.896 | NULL(0.5) | 0.995 | 0.990 | 0.005 | 0.840 |
| Penalized Log Reg | Lambd =0.0000055, alpha = 1 | 0.95 | 0.27052632 | 0.995 | 0.956 | 0.003 | 0.901 |
| Logistic Regression | Lambd =0.00416 | 0.779 | 0.2705 | 0.986 | 1 | 0.014 | 0.558 |
| KNN | k = 5 | 0.985 | NULL | 0.997 | 0.966 | 0.00098 | 0.970 |
#: get predictions
train_prediction.RF = predict(RF_Model,data = data[c("Red","Green","Blue")])
pred.BT = predict(Boost_Model,data, type="response")
train_prediction.BT = ifelse(pred.BT >= Boost_opt_threshold, 1L, 0L)
pred[ind_eval] = predict(fit, data[ind_eval,], probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]#> Error in predict.glmnet(object$glmnet.fit, newx, s = lambda, ...): The number of variables in newx must be 3
yhat = ifelse(pred > .50, 1, 0) # hard classification
train_prediction.SVM = yhat
train_prediction.NB = predict(Naive_Bayes_Model,newdata= data[c("Red","Green","Blue")])
LDA_Prediction = predict(LDA_Model,newdata= data[c("Red","Green","Blue")])
train_prediction.LDA <- LDA_Prediction$class
QDA_Prediction = predict(QDA_Model,newdata= data[c("Red","Green","Blue")])
train_prediction.QDA<- QDA_Prediction$class
Enet.predict <-predict(Enet_Model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
train_prediction.Enet = ifelse(Enet.predict >= Enet.optThreshold, 1L, 0L)
LR.predict = predict(LR_model,newx = as.matrix(data[c("Red","Green","Blue")]), type="response")
train_prediction.LR= ifelse(LR.predict >= LR.optThreshold, 1L, 0L)
pred_data =
tibble(
enet = as.numeric(unlist(Enet.predict)),
logistic = as.numeric(unlist(LR.predict)),
linked = data$bluetarp %>% factor(levels=c(1,0)),
knn = as.numeric(unlist(KNN_Model)),
svm = as.numeric(pred),
random_forest = as.numeric(unlist(train_prediction.RF)),
boosted_tree = as.numeric(unlist(pred.BT)),
Naive_Bayes = as.numeric(unlist(train_prediction.NB)),
LDA = as.numeric(unlist(LDA_Prediction$posterior[,2])),
QDA = as.numeric(unlist(QDA_Prediction$posterior[,2]))
) %>% # note: response must be a factor
# convert to long format
gather(model, prediction, enet, logistic,knn,svm, random_forest,boosted_tree,Naive_Bayes,LDA,QDA) %>%
# get performance for each model
group_by(model) %>%
roc_curve(truth = linked, prediction) # get ROC curve data
#: plot the ROC curve
pred_data %>%
ggplot(aes(1-specificity, sensitivity, color=model)) +
geom_path() +
geom_smooth(values=c(enet='black', logistic='brown4', knn ='cyan4',svm ='khaki3',random_forest = '#8B7B8B', boosted_tree = 'orange', Naive_BAYES = 'purple', LDA = 'palegreen2', QDA = '#7FFFD4'),size=3, span=0.8) +
scale_x_continuous(breaks=seq(0, 1, by=.2)) +
scale_y_continuous(breaks=seq(0, 1, by=.2)) + labs(title ="ROC Curves")In general, the performance of the four models is similar. As for the data I used, the proportion of positive and negative examples has reached over 1:10. Therefore, it is no longer meaningful to only look at the accuracy performance (if you keep taking samples and immediately classify them as not blue trap, you can get more than 90% accuracy classifier). Once we should pay more attention to TPR, FPR and percision. For humanitarian aid, our goal is to help as many victims as possible. When we have sufficient supplies and human resource, our philosophy is to find and treat all the victims. At this point, we need to reduce the case of False Negative generated by model prediction. This requires that the TPR of the model be high enough. At this time, it seems that the KNN model I trained is more satisfying. Its TPR is 96. Unfortunately, we didn’t have enough resources at the time when the disaster happened. Therefore, we need to help the homeless as efficiently as possible. We should avoid waste of medical resources and manpower as much as possible. At this time, we need to choose a model with a lower FPR. At this point, logistic Regression seems to be the more appropriate model. It has the smallest FPR, although it also has the smallest TPR (which means we’ll miss a lot of Blue TARP. But it avoids wasting resources. In general, KNN should be a perfect model among the models I have trained except for the two extreme cases mentioned above.
My model was trained from the haitiTrain.csv dataset. The performance of the model in this data set is far beyond my imagination. Percision is close to or above 90%. So they certainly help us quickly and accurately locate the blue TARP and deduce the location of the victims. But I don’t think this result is so superior in the real world. The first is that even if we were able to locate the Blue Tarp accurately, there would not always be homeless people living under the Blue TARP. Locating people is only the first step in helping them.
I think this data fits well with the model we chose in part1. First of all, the characteristic dimension of the sample is not high and the correlation between the three dimensions and Class is very high. Each dimension in RGB is independent of each other. Therefore, we do not need to do too much pre-processing before training the model. Secondly, these models are not very suitable for multi-classification problems, especially for Logistic regression. If we need to use these models for multiple classifications, we may need to introduce SoftMax layer or use multiple classifiers to achieve this goal, which will undoubtedly reduce accuracy. But for this problem, we only need to find the Blue TARP which means we only need to deal with binary classification. These relatively simple classifiers can do the job on their own. And the third is that color blue is special enough in all the samples. As I showed in the EDA section, their location in RGB 3d space is very concentrated and linearly separable from other colors. Therefore, no matter we use KNN(which only pay attention on their neighbors), logistic Regression model which can only deal with linear problems, or SVM classifier which Because can be used in more classification scenarios thanks to the kernal function, we can all get good results. For these data, we can even visually find a plane to separate the two types of data. So it’s definitely not too difficult for a classifier.
In the confuse matrix predicted by several of my models, more errors occur in False Negitave. This means that many of the blue sheds photographed in RBG maps have deviated from their original color. This may be due to other factors. Like the angle of the sunshine, or the weather. Maybe we can add new dimensions to the feature of data, like the weather at the time of the photo, the time of day. This might improve the performance of the model.
ADDITIONAL SECTIONS FOR PART II:
Load hold-out data, explore data, etc.
ortho057_noblue = read_table("orthovnir057_ROI_NON_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="057")
ortho067_blue = read_table("orthovnir067_ROI_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 1, file ="067")
ortho067_noblue = read_table("orthovnir067_ROI_NOT_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="067")
ortho069_noblue = read_table("orthovnir069_ROI_NOT_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 0, file ="069")
ortho069_blue = read_table("orthovnir069_ROI_Blue_Tarps.txt",skip = 8, col_names = FALSE) %>% dplyr::select(Red = X8,Green = X9,Blue = X10) %>% mutate(bluetarp = 1, file ="069")
hold_out_data = {}
hold_out_data = rbind(ortho057_noblue,ortho067_blue)
hold_out_data = rbind(hold_out_data,ortho067_noblue)
hold_out_data = rbind(hold_out_data,ortho069_blue)
hold_out_data = hold_out_data[sample(1:nrow(hold_out_data)), ]head(hold_out_data,n=50)#> # A tibble: 50 × 5
#> Red Green Blue bluetarp file
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 255 197 132 0 057
#> 2 117 96 69 0 057
#> 3 232 202 157 0 057
#> 4 109 88 65 0 057
#> 5 101 87 59 0 057
#> 6 121 102 69 0 057
#> 7 184 161 122 0 067
#> 8 134 99 69 0 057
#> 9 98 94 59 0 057
#> 10 103 80 56 0 057
#> # … with 40 more rows
#-- The ratio of bluetarp and non-bluetarp
ggplot(hold_out_data) +
geom_bar(mapping = aes(x = bluetarp))plotData <- hold_out_data[1:65000,]
plot_ly(x=plotData$Red, y=plotData$Green, z=plotData$Blue, type="scatter3d", mode="markers", color=plotData$bluetarp) %>% layout(autosize = F, width = 500, height = 500)plot_ly(x=data$Red, y=data$Green, z=data$Blue, type="scatter3d", mode="markers", color=data$bluetarp) %>% layout(autosize = F, width = 500, height = 500)It seems that the training data and hold out data has the same distribution. So we can expect that trained model still work # Results (Hold-Out)
For KNN we need to train a new model with the Optimal K.
#knn
hold_out_model.knn <- FNN::knn(train = scale(hold_out_data[c("Red","Green","Blue")]),
test = scale(hold_out_data[c("Red","Green","Blue")]),
cl = hold_out_data$bluetarp,
k = 7,
prob = TRUE)
hold_out_knn.cm = table(hold_out_model.knn, hold_out_data$bluetarp)
print(hold_out_knn.cm)#>
#> hold_out_model.knn 0 1
#> 0 1284305 224
#> 1 184 11050
plot(hold_out_knn.cm)CM = hold_out_knn.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_knn.acc = (TP + TN)/(FP + FN + TP + TN)
hold_knn.pre = TP/(TP + FP)
hold_knn.tpr = TP/(TP+FN)
hold_knn.fpr = FP/(FP + TN)
hold_knn.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_out_model.knn))#
Hold_LR.predict = predict(LR_model,newx = as.matrix(hold_out_data[c("Red","Green","Blue")]), type="response")
hold_train_prediction.LR= ifelse(Hold_LR.predict >= LR.optThreshold, 1L, 0L)
hold_out_lr.cm = table(hold_train_prediction.LR, hold_out_data$bluetarp)
print(hold_out_lr.cm)#>
#> hold_train_prediction.LR 0 1
#> 0 1284470 5827
#> 1 19 5447
plot(hold_out_lr.cm)CM = hold_out_lr.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_lr.acc = (TP + TN)/(FP + FN + TP + TN)
hold_lr.pre = TP/(TP + FP)
hold_lr.tpr = TP/(TP+FN)
hold_lr.fpr = FP/(FP + TN)
hold_lr.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.LR))#
Hold_Enet.predict <-predict(Enet_Model,newx = as.matrix(hold_out_data[c("Red","Green","Blue")]), type="response")
hold_train_prediction.Enet = ifelse(Hold_Enet.predict >= Enet.optThreshold, 1L, 0L)
hold_out_enet.cm = table(hold_train_prediction.Enet, hold_out_data$bluetarp)
print(hold_out_enet.cm)#>
#> hold_train_prediction.Enet 0 1
#> 0 1280203 113
#> 1 4286 11161
CM = hold_out_enet.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_Enet.acc = (TP + TN)/(FP + FN + TP + TN)
hold_Enet.pre = TP/(TP + FP)
hold_Enet.tpr = TP/(TP+FN)
hold_Enet.fpr = FP/(FP + TN)
hold_Enet.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.Enet))#
Hold_pred = predict(fit, hold_out_data, probability = TRUE) %>%
attr("probabilities") %>% .[,"1"]#> Error in predict.glmnet(object$glmnet.fit, newx, s = lambda, ...): The number of variables in newx must be 3
hold_train_prediction.SVM = ifelse(Hold_pred > .50, 1, 0) # hard classification#> Error in ifelse(Hold_pred > 0.5, 1, 0): 找不到对象'Hold_pred'
hold_out_svm.cm = table(hold_train_prediction.SVM, hold_out_data$bluetarp)#> Error in table(hold_train_prediction.SVM, hold_out_data$bluetarp): 找不到对象'hold_train_prediction.SVM'
print(hold_out_svm.cm)#> Error in h(simpleError(msg, call)): 在为'print'函数选择方法时评估'x'参数出了错: 找不到对象'hold_out_svm.cm'
plot(hold_out_svm.cm)#> Error in plot(hold_out_svm.cm): 找不到对象'hold_out_svm.cm'
CM = hold_out_svm.cm#> Error in eval(expr, envir, enclos): 找不到对象'hold_out_svm.cm'
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_svm.acc = (TP + TN)/(FP + FN + TP + TN)
hold_svm.pre = TP/(TP + FP)
hold_svm.tpr = TP/(TP+FN)
hold_svm.fpr = FP/(FP + TN)
hold_svm.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.SVM))#> Error in roc.default(response, predictor, auc = TRUE, ...): 找不到对象'hold_train_prediction.SVM'
hold_train_prediction.NB = predict(Naive_Bayes_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_out_nb.cm = table(hold_train_prediction.NB, hold_out_data$bluetarp)
print(hold_out_nb.cm)#>
#> hold_train_prediction.NB 0 1
#> 0 1283856 8446
#> 1 633 2828
plot(hold_out_nb.cm)CM = hold_out_nb.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_nb.acc = (TP + TN)/(FP + FN + TP + TN)
hold_nb.pre = TP/(TP + FP)
hold_nb.tpr = TP/(TP+FN)
hold_nb.fpr = FP/(FP + TN)
hold_nb.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.NB))hold_LDA_Prediction = predict(LDA_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_train_prediction.LDA <- hold_LDA_Prediction$class
hold_out_lda.cm = table(hold_train_prediction.LDA, hold_out_data$bluetarp)
print(hold_out_lda.cm)#>
#> hold_train_prediction.LDA 0 1
#> 0 1280100 2093
#> 1 4389 9181
plot(hold_out_lda.cm)CM = hold_out_lda.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_lda.acc = (TP + TN)/(FP + FN + TP + TN)
hold_lda.pre = TP/(TP + FP)
hold_lda.tpr = TP/(TP+FN)
hold_lda.fpr = FP/(FP + TN)
hold_lda.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.LDA))hold_QDA_Prediction = predict(QDA_Model,newdata= hold_out_data[c("Red","Green","Blue")])
hold_train_prediction.QDA<- hold_QDA_Prediction$class
hold_out_qda.cm = table(hold_train_prediction.QDA, hold_out_data$bluetarp)
print(hold_out_qda.cm)#>
#> hold_train_prediction.QDA 0 1
#> 0 1284099 3202
#> 1 390 8072
plot(hold_out_qda.cm)CM = hold_out_qda.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_qda.acc = (TP + TN)/(FP + FN + TP + TN)
hold_qda.pre = TP/(TP + FP)
hold_qda.tpr = TP/(TP+FN)
hold_qda.fpr = FP/(FP + TN)
hold_qda.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.QDA))
#hold_pred.BT = predict(Boost_Model,hold_out_data, type="response")
hold_train_prediction.BT = ifelse(hold_pred.BT >= Boost_opt_threshold, 1L, 0L)
hold_out_bt.cm = table(hold_train_prediction.BT, hold_out_data$bluetarp)
print(hold_out_bt.cm)#>
#> hold_train_prediction.BT 0 1
#> 0 1278995 751
#> 1 5494 10523
plot(hold_out_bt.cm)CM = hold_out_bt.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_bt.acc = (TP + TN)/(FP + FN + TP + TN)
hold_bt.pre = TP/(TP + FP)
hold_bt.tpr = TP/(TP+FN)
hold_bt.fpr = FP/(FP + TN)
hold_bt.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.BT))
##
hold_train_prediction.RF = predict(RF_Model,newdata = hold_out_data[c("Red","Green","Blue")])
hold_out_rf.cm = table(hold_train_prediction.RF, hold_out_data$bluetarp)
print(hold_out_rf.cm)#>
#> hold_train_prediction.RF 0 1
#> 0 1282220 1880
#> 1 2269 9394
plot(hold_out_rf.cm)CM = hold_out_rf.cm
FP = CM[1,2]
FN = CM[2,1]
TP = CM[2,2]
TN = CM[1,1]
#get the accuracy precision FPR and FPR
hold_rf.acc = (TP + TN)/(FP + FN + TP + TN)
hold_rf.pre = TP/(TP + FP)
hold_rf.tpr = TP/(TP+FN)
hold_rf.fpr = FP/(FP + TN)
hold_rf.auc = auc(as.numeric(hold_out_data$bluetarp), as.numeric(hold_train_prediction.RF))| Model | AUROC | Accuracy | TPR | FPR | Precision |
|---|---|---|---|---|---|
| Random Forest | 0.887 | 0.996 | 0.786 | 0.0019 | 0.774 |
| Boosted Tree | 0.961 | 0.995 | 0.680 | 0.0006 | 0.925 |
| SVM | 0.785 | 0.993 | 0.638 | 0.0037 | 0.571 |
| Naive Bayes | 0.625 | 0.993 | 0.817 | 0.0065 | 0.253 |
| LDA | 0.905 | 0.995 | 0.6765 | 0.0016 | 0.814 |
| QDA | 0.858 | 0.997 | 0.9534 | 0.0024 | 0.715 |
| Penalized Log Reg | 0.993 | 0.996 | 0.722 | 0.000008 | 0.9899 |
| Logistic Regression | 0.742 | 0.995 | 0.996 | 0.004 | 0.483 |
| KNN | 0.99 | 0.999 | 0.983 | 0.00017 | 0.980 |
Conclusions for part2: ## Conclusion #1 It seems that the SVM or KNN model performs the best in CV performance. SVM has more degrees of freedom in the training process and the characteristics of SVM make it less prone to overfitting. In hold-out data, because bluetrap accounts for a smaller proportion of all data, accuracy becomes less important. The overall performance of KNN is better in this case.It’s AUROC,percision are both also to 1.
In the test of hold-out data. The FPR of KNN is only 0.0017, which means we have very little discipline to misjudge a location as a bluetarp. This is very important for humanitarian relief. Because our supplies are often insufficient during rescue. Locating refugee locations more efficiently and accurately means we can deploy supplies more efficiently without being disturbed by false results. Apart from that, other parameters of this model also performed well. Compared to Penalized Log Reg. He has a higher TPR. This means that it is also not possible for us to miss a lot of refugees.
The choice of model also depends on the specific situation. For example, when we have sufficient supplies and rescue forces, we want to avoid the FN situation. At this time Penalized Logistic Regression is still the best choice. When our supplies and rescue forces are insufficient, we should try to avoid the FP situation. At this time Logistic Regression is a very extreme choice. Although its overall performance is poor, it only classified 19 points that are not bluetarp as bluetarp.
Since the data is uneven accuracy is arguably the most useless data in the table. In contrast, AUROC can better represent the performance of the model. It basically aggregates the performance of the model at all threshold values. These metrics are also mutually exclusive in certain cases. For example, when the precision of our model is very high, it means that FP is a very small number. However for FPR(is equal to FP/(FP +FN). When we have a very small FP FPR will have a very large number less chance. TPR and percision also have a certain correlation. In some extreme cases (FN = TN) and the denominators of their expressions are both TP. At this point the values of the two metrics will be very close.
In this project, we can draw a conclusion that sometimes accuracy is not very meaningful. Over 90% of the data we use is not bluetarp. Under such data, even if we predict all the data as not bluetarp, our model accuracy can still be higher than 95%. But that completely defeats the reason why we are building a model to make predictions. We should define the loss function at this time to reevaluate the performance of our model. In addition, we can also customize the weight of the loss function to make the model more aggressive in a certain direction. ### Conclusion #6 The best threshold depends on the objective of the model. The threshold is not a fixed number (I used to think it could always be set to 0.5). Through this project I think we should change the threshold size according to our needs. After defining the loss function, we can find the most suitable threshold for our application scenario by traversing all the possible thresholds.
Data mining is definitely not an easy task. At first I thought that finding bluetarp using RGB data would be a particularly easy task. Because blue is less likely to appear on land, the correlation between blue and bluetrap should be very high. Then the RGB data we used corresponds exactly to blue. It can be said that it will not be a difficult thing to build a model and classify with such a high correlation feature. But in the project we still need a lot of parameters in various models. These parameters also have a large impact on the model. Whether the data is balanced, the loss function we define, and what parameters are used to express the performance of the model. These are all factors we need to consider. Not to mention that in other problems we also need to do a lot of preprocessing on the data. For example, we need to drop some features that are irrelevant to prediction, or delete some repetitive features. So data mining is more than just feeding data into a function and getting a model.